Remark

Please be aware that these lecture notes are accessible online in an ‘early access’ format. They are actively being developed, and certain sections will be further enriched to provide a comprehensive understanding of the subject matter.

4.6. Major Statistical Methods for Outlier Detection in Time Series#

4.6.1. Descriptive Statistics-Based Methods#

4.6.1.1. Z-Score Method#

Purpose: Measure how many standard deviations a point is from the mean.

Formula:

(4.10)#\[\begin{equation} Z = \frac{x - \mu}{\sigma} \end{equation}\]

Outlier Criterion: \(|Z| > 3\)

Example 4.5

Daily temperatures with mean 72°F and standard deviation 8°F.

  • Reading of 98°F: \(Z = \frac{98-72}{8} = 3.25\)Outlier

  • Reading of 85°F: \(Z = \frac{85-72}{8} = 1.625\)Normal

Limitations
  • Sensitive to existing outliers (outliers inflate \(\mu\) and \(\sigma\))

  • Assumes normal distribution

Hide code cell source

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(1515)

# Simulate 180 days of temperatures (~mean 72F, sd ~8F) with mild seasonality
idx = pd.date_range('2024-04-01', periods=180, freq='D')
mean_true, sd_true = 72, 8
season = 4 * np.sin(2*np.pi*(idx.dayofyear)/365)
noise = np.random.normal(0, sd_true, size=len(idx))
temps = np.array(mean_true + season + noise, dtype=float)

# Inject extremes to trigger |Z| > 3
temps[50] += 26   # hot outlier
temps[120] -= 28  # cold outlier

# Calculate Z-scores
mu = temps.mean()
sigma = temps.std(ddof=1)
z_scores = (temps - mu) / sigma
is_outlier = np.abs(z_scores) > 3

# Save dataset
z_df = pd.DataFrame({'date': idx, 'temp_F': temps, 'z_score': z_scores, 'is_outlier': is_outlier})

Hide code cell source

import itables
import pandas as pd

itables.init_notebook_mode(all_interactive=False)
with pd.option_context("display.float_format", "{:,.4f}".format):
     itables.show(z_df.rename(columns={'date': 'Date', 'temp_F': 'Temperature (°F)', 'z_score': 'Z-Score', 'is_outlier': 'Is Outlier'}), **itables_dict)
Loading ITables v2.6.1 from the init_notebook_mode cell... (need help?)
../_images/zscore_method_illustration.png

Fig. 4.23 Z-Score method applied to daily temperatures. Top panel shows the time series with detected outliers marked in red. Bottom panel shows z-score values, with points beyond ±3σ flagged.#

4.6.1.2. Modified Z-Score Method#

Purpose: Robust alternative using median and MAD (Median Absolute Deviation).

Formula:

(4.11)#\[\begin{equation} M_i = \frac{0.6745(x_i - \tilde{x})}{MAD} \end{equation}\]

where \(\tilde{x}\) = median, \(MAD = \text{median}(|x_i - \tilde{x}|)\)

Outlier Criterion: \(|M| > 3.5\)

Example 4.6

Monthly sales with median \(50,000 and MAD \)8,000.

  • Sales of \(85,000: \)M = \frac{0.6745(85000-50000)}{8000} = 2.95$ → Normal

  • Sales of \(120,000: \)M = \frac{0.6745(120000-50000)}{8000} = 5.89$ → Outlier

Advantages
  • Resistant to extreme values

  • Better for skewed distributions

Hide code cell source

# Generate synthetic sales data with skew
np.random.seed(1717)
idx = pd.date_range('2020-01-01', '2024-12-31', freq='MS')
months = idx.month
n = len(idx)

# Seasonal pattern with Q4 uplift
q = ((months-1)//3 + 1)
season_mult = np.choose(q-1, [1.00, 1.05, 1.10, 1.45])
trend = 1.02 ** np.arange(n)
lognorm = np.random.lognormal(mean=0.0, sigma=0.20, size=n)
sales = 50000 * season_mult * trend * \
    lognorm + np.random.normal(0, 3000, size=n)
np.maximum(30000, sales)
sales = np.array(sales, dtype=float)

# Inject outliers
sales[47] *= 2.2  # Dec 2023
sales[15] *= 0.5  # Mar 2021

# Modified Z-score
median = np.median(sales)
mad = np.median(np.abs(sales - median))
mad = max(mad, 1e-6)
modified_z = 0.6745 * (sales - median) / mad
is_outlier = np.abs(modified_z) > 3.5

sales_df = pd.DataFrame({'month': idx,
                         'sales': sales,
                         'modified_z': modified_z,
                         'is_outlier': is_outlier
                         })

Hide code cell source

import itables
import pandas as pd

itables.init_notebook_mode(all_interactive=False)
with pd.option_context("display.float_format", "{:,.4f}".format):
     itables.show(sales_df.rename(columns={'month': 'Month', 'sales': 'Sales ($)', 'modified_z': 'Modified Z-Score', 'is_outlier': 'Is Outlier'}), **itables_dict)
Loading ITables v2.6.1 from the init_notebook_mode cell... (need help?)
../_images/modified_zscore_illustration.png

Fig. 4.24 Modified Z-Score method applied to monthly sales data with seasonal patterns and skewness. This robust method uses median and MAD, making it less sensitive to existing outliers.#

4.6.1.3. Interquartile Range (IQR) Method#

Purpose: Range-based detection using the middle 50% of data.

Method:

  • Calculate \(IQR = Q3 - Q1\)

  • Define bounds: \([Q1 - 1.5 \times IQR,\ Q3 + 1.5 \times IQR]\)

Outlier Criterion: Points outside the bounds

Example 4.7

Hourly website traffic with Q1 = 1,200 and Q3 = 1,800 visitors.

  • \(IQR = 1,800 - 1,200 = 600\)

  • Lower bound: \(1,200 - 1.5(600) = 300\)

  • Upper bound: \(1,800 + 1.5(600) = 2,700\)

  • Traffic of 3,500 → Outlier

  • Very robust against outliers

  • Works well with skewed distributions

  • Foundation of box plot outlier detection

Hide code cell source

# Generate hourly traffic data
np.random.seed(1818)
hours = 30 * 24
idx = pd.date_range('2024-05-01', periods=hours, freq='h')
hour = idx.hour
weekday = idx.dayofweek

# Diurnal pattern with weekday effect
base = 1400 + 500*np.sin(2*np.pi*(hour-16)/24)
base += 150 * (weekday < 5)
traffic = base + np.random.normal(0, 120, size=hours)
traffic = np.maximum(50, traffic)
traffic = np.array(traffic, dtype=float)

# Inject outliers
traffic[200] += 1800
traffic[350] -= 900

# IQR method
Q1 = np.percentile(traffic, 25)
Q3 = np.percentile(traffic, 75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
is_outlier = (traffic < lower) | (traffic > upper)

df_traffic = pd.DataFrame({'timestamp': idx,
                           'traffic': traffic,
                           'is_outlier': is_outlier
                           })
Loading ITables v2.6.1 from the init_notebook_mode cell... (need help?)
../_images/iqr_method_illustration.png

Fig. 4.25 IQR method applied to hourly website traffic. Top panel shows time series with outlier bounds derived from Q1 and Q3. Bottom panel groups data by hour, revealing context-specific outliers.#

4.6.2. Time Series Modeling-Based Methods#

4.6.2.1. ARIMA-Based Detection#

Purpose: Model normal time series behavior with ARIMA using historical data, then detect outliers by comparing predictions against actual observations.

Method:

  1. Split data into training (historical) and testing (recent) periods

  2. Fit ARIMA(p,d,q) or SARIMA model on training data

  3. Generate forecasts for testing period

  4. Calculate forecast errors: \(\text{Error}_t = \text{Actual}_t - \text{Forecast}_t\)

  5. Flag outliers where \(|\text{Error}| > k\sigma_{\text{error}}\) (typically \(k=3\))

Outlier Types Detected:

  • Additive Outliers (AO): Single-point spikes that return to normal

  • Level Shifts (LS): Permanent level changes after a specific time

  • Temporary Changes (TC): Gradual decay patterns with persistence parameter \(\delta\)

Note

ARIMA modeling is covered in detail in Chapter XX. This section focuses on using fitted ARIMA models for outlier detection through forecast comparison.

Example 4.8

Monthly airline passengers (2010–2022)

  • Training: 2010–2019 (120 months) to capture normal patterns

  • Testing: 2020–2022 (36 months) to detect anomalies

  • Model: SARIMA(0,1,1)(0,1,1)[12] captures trend and seasonality

  • Normal forecast error: ±150 passengers

  • March 2020 forward: Errors exceed −2000 passengers → Outliers (COVID-19 impact)

Advantages
  • Realistic out-of-sample evaluation

  • Accounts for autocorrelation, trend, and seasonality

  • Clear separation between normal variation and true anomalies

  • Can classify outlier types through error pattern analysis

Limitations
  • Requires sufficient historical data (minimum 2–3 seasonal cycles)

  • Model misspecification can produce false positives

  • Assumes future follows historical patterns (breaks during regime changes)

  • Forecast uncertainty increases with horizon length

  • Not suitable for data with frequent structural breaks

Hide code cell source

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from scipy import stats

np.random.seed(2323)

# Generate synthetic monthly passenger data (2010-2022: 156 months)
idx = pd.date_range('2010-01-01', '2022-12-31', freq='MS')
n = len(idx)
months = idx.month
years = idx.year - 2010

# synthetic components
trend = 8000 + 120 * np.arange(n)  # Linear growth
seasonal = 1200 * np.sin(2*np.pi*(months - 6)/12)  # Summer peaks
noise = np.random.normal(0, 150, size=n)
passengers = trend + seasonal + noise
passengers = np.maximum(5000, passengers)
passengers = np.array(passengers, dtype=int)

# Inject synthetic outlier patterns
passengers_with_anomalies = passengers.copy()

# AO: Single spike in July 2018 (special event like Olympics)
idx_ao = np.where((years == 8) & (months == 7))[0][0]
passengers_with_anomalies[idx_ao] += 1500

# LS: Permanent capacity increase in Jan 2017 (new terminal)
idx_ls = np.where((years == 7) & (months == 1))[0][0]
passengers_with_anomalies[idx_ls:] += 800

# TC: COVID-19 impact starting March 2020 with gradual recovery
idx_tc = np.where((years == 10) & (months == 3))[0][0]
impact_base = -5000
decay_factor = 0.92  # Slower recovery
for i, t in enumerate(range(idx_tc, n)):
    passengers_with_anomalies[t] += impact_base * (decay_factor ** i)

# Split into training (2010-2019) and testing (2020-2022)
train_end = '2019-12-31'
train_mask = idx <= train_end
test_mask = idx > train_end

train_data = passengers_with_anomalies[train_mask]
test_data = passengers_with_anomalies[test_mask]
train_idx = idx[train_mask]
test_idx = idx[test_mask]

# Fit SARIMA on training data only
model = SARIMAX(
    train_data,
    order=(0, 1, 1),
    seasonal_order=(0, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False
)
result = model.fit(disp=False)

# Generate forecasts for test period
n_forecast = len(test_data)
forecast = result.forecast(steps=n_forecast)
forecast_values = forecast.values if hasattr(forecast, 'values') else forecast

# Calculate forecast errors
forecast_errors = test_data - forecast_values
error_std = forecast_errors.std(ddof=1)
threshold = 3 * error_std
is_outlier_test = np.abs(forecast_errors) > threshold

# Also get in-sample residuals for comparison
train_residuals = result.resid
train_resid_std = train_residuals.std(ddof=1)
train_threshold = 3 * train_resid_std
is_outlier_train = np.abs(train_residuals) > train_threshold

# Combine for full series
full_residuals = np.concatenate([train_residuals, forecast_errors])
full_is_outlier = np.concatenate([is_outlier_train, is_outlier_test])
full_fitted = np.concatenate([result.fittedvalues, forecast_values])

# Create comprehensive dataframe
df_passengers = pd.DataFrame({
    'month': idx,
    'actual': passengers_with_anomalies,
    'fitted_forecast': full_fitted,
    'error': full_residuals,
    'is_outlier': full_is_outlier,
    'period': ['Train' if m <= pd.Timestamp(train_end) else 'Test' for m in idx]
})

# Summary statistics
summary = {
    'model': 'SARIMA(0,1,1)(0,1,1)[12]',
    'train_period': f"{train_idx[0].strftime('%Y-%m')} to {train_idx[-1].strftime('%Y-%m')}",
    'test_period': f"{test_idx[0].strftime('%Y-%m')} to {test_idx[-1].strftime('%Y-%m')}",
    'train_months': len(train_data),
    'test_months': len(test_data),
    'train_resid_std': round(train_resid_std, 2),
    'test_error_std': round(error_std, 2),
    'train_outliers': int(is_outlier_train.sum()),
    'test_outliers': int(is_outlier_test.sum()),
    'outlier_dates_test': test_idx[is_outlier_test].strftime('%Y-%m').tolist()[:5]  # First 5
}
Loading ITables v2.6.1 from the init_notebook_mode cell... (need help?)
Value
model SARIMA(0,1,1)(0,1,1)[12]
train_period 2010-01 to 2019-12
test_period 2020-01 to 2022-12
train_months 120
test_months 36
train_resid_std 887.25
test_error_std 1318.63
train_outliers 2
test_outliers 3
outlier_dates_test [2020-03, 2020-04, 2020-05]

The ARIMA model fits the training period well, but forecast errors in 2020–2022 become much larger, especially right after early 2020, which signals a structural break that the model was not built to anticipate. The first few months of the pandemic show large negative errors (actual far below forecast), then errors gradually move back toward zero as demand recovers.

The very large, short‑lived drops behave like a temporary change (TC) rather than a permanent level shift, because the series slowly returns toward its pre‑COVID trajectory. Earlier anomalies in the training period would appear as isolated spikes (AO) or step changes (LS), but in this example the main emphasis is on the regime change in the test window.

When this setup works well

  • There is a reasonably stable historical pattern (trend + seasonality) to learn from.

  • The goal is to spot periods where reality departs sharply from “business as usual” forecasts.

  • You can afford to periodically refit the model as new data arrive.

Main caveats

  • ARIMA assumes the future looks statistically similar to the past; large regime shifts (pandemics, policy shocks) break this assumption, so very large errors may be better interpreted as regime change than isolated outliers.

  • Forecast uncertainty grows with horizon, so very long test windows and a fixed ±3σ rule can either miss relevant anomalies or over‑flag normal volatility.

  • The 3σ threshold is heuristic; in practice it should be tuned using domain knowledge and tolerance for false alarms.

../_images/arima_outlier_detection_train_test.png

Fig. 4.26 Out-of-sample ARIMA-based outlier detection using a train–test split. Panel (a) shows training data plus test-period actual vs forecast and flagged test outliers; panel (b) shows corresponding forecast errors with a ±3σ band.#

4.6.2.2. STL Decomposition#

Purpose: Separate a time series into trend, seasonal, and residual components; detect outliers in the residuals.

Method:

  1. Decompose: Series = Trend + Seasonal + Residual

  2. Apply outlier detection to residual component only (e.g., \(|\text{Residual}| > 3\sigma\))

  3. Large residuals indicate values that the trend and seasonality cannot explain

Outlier Criterion: \(|\text{Residual}| > 3\sigma_{\text{residual}}\)

Example 4.9

Monthly ice cream sales (2019–2024)

  • Strong summer peaks (seasonality) and gradual growth (trend)

  • January 2023: residual = +15,000 units (all others ±5,000) → Outlier (unusual warm weather?)

Advantages
  • Handles complex, non-linear seasonality

  • Robust decomposition (outliers don’t distort trend/seasonal estimates)

  • Clear visual separation of components

Loading ITables v2.6.1 from the init_notebook_mode cell... (need help?)

After removing the trend and seasonal components, any remaining spike in the residuals is something the model cannot explain—this is your outlier. The January 2023 value looks normal on the surface (it’s within the typical summer range), but the decomposition reveals it’s unusually high for January, flagging it as an anomaly.

When to use this

  • Data has clear, regular seasonality (monthly, daily, hourly patterns).

  • You want to isolate structure (trend + season) separately from anomalies.

  • You can fit the series robustly (STL’s robust=True handles outliers well).

Main limitation

  • If the model cannot fit the trend or seasonality well (e.g., from bad parameter choices), residuals become noisy and threshold-based detection becomes unreliable.

../_images/stl_decomposition_outliers.png

Fig. 4.27 STL decomposition of monthly ice cream sales. Panels show observed data, trend, seasonal pattern, and residuals. Outliers appear as spikes in the residual panel beyond ±3σ bands.#

4.6.2.3. Change Point Detection#

Purpose: Identify sudden shifts in the statistical properties of a time series (mean, variance, or trend).

Method: Use CUSUM (Cumulative Sum Control Chart) to detect when a series mean shifts. Flag points near the detected change as potential anomalies.

Example 4.10

Server response times

  • Baseline: 200 ms with low variance (stable operation)

  • Day 45 onward: mean jumps to ~260 ms, variance increases

  • Days 44–46 show spikes >1000 ms → system-level issue, not isolated outliers

Large errors clustered around a change point often indicate a regime shift rather than scattered anomalies.

Hide code cell source

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(2525)

# Generate daily response times (90 days)
n = 90
idx = pd.date_range('2024-01-01', periods=n, freq='D')

# Baseline: ~200 ms
response_times = np.random.normal(200, 20, size=n)

# Abrupt shift at day 45: +60 ms, higher variance
shift_day = 45
response_times[shift_day:] += 60
response_times[shift_day:] += np.random.normal(0, 25, size=n-shift_day)

# Add spikes around the change (system struggling)
spike_days = [44, 45, 46]
for d in spike_days:
    response_times[d] += 800

# Simple CUSUM: detect mean shift
baseline_mean = response_times[:30].mean()
baseline_std = response_times[:30].std(ddof=1)

# CUSUM parameters
K = 0.5 * baseline_std
H = 5 * baseline_std
cusum = np.zeros(n)

for t in range(1, n):
    cusum[t] = max(0, cusum[t-1] + (response_times[t] - (baseline_mean + K)))

# Find alarm point
alarm_idx = int(np.argmax(cusum > H)) if (cusum > H).any() else None
Loading ITables v2.6.1 from the init_notebook_mode cell... (need help?)

The CUSUM chart shows the cumulative deviation from a baseline level. When it crosses the threshold (red line), it signals a sustained shift in the mean—not an isolated spike, but a regime change. The large response times around day 45 are symptoms of the underlying system issue, not separate outliers to flag independently.

When to use this

  • Monitoring systems that can undergo sudden shifts (operational changes, hardware failures, traffic surges).

  • You want to distinguish between normal variability and structural breaks.

  • Interest is in identifying when things changed, not just individual anomalies.

Limitation

  • CUSUM works well for detecting mean shifts but not sudden spikes in variance. For that, use specialized libraries (e.g., ruptures, pelt).

../_images/changepoint_detection.png

Fig. 4.28 Change point detection using CUSUM. Panel 1 shows response times with a mean shift around day 45 and associated spikes. Panel 2 shows the CUSUM statistic triggering an alarm near the shift.#